Improving Exact Search of Multiple Patterns From a Compressed Suffix Array

نویسنده

  • Kalle Karhu
چکیده

Self-indexes are largely studied and widely applied structures in string matching. However, the exact matching of multiple patterns using self-indexes is a topic that has not been the subject of concentrated study although it is an area that may have direct and indirect applications and uses in fields such as bioinformatics. This paper presents a method of improving the exact search of multiple patterns from a compressed suffix array. The proposed method is able to cut down run-times for the handled patterns by as much as 71.6%. A set of 1000 patterns of length 1000 nucleotides each is found from a text of 50 MB in size 14.0% faster than by searching the patterns using the locate functionality of the compressed suffix array.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes

Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed inde...

متن کامل

A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression

Now the Block sorting compression [l] becomes common by its good balance of compression ratio and speed. It has another nice feature, which is the relation between encoding/decoding process and suffix array. The suffix array [2] is a memory-efficient data structure for searching any substring of a text. It is an array of lexicographically sorted pointers to suffixes of a text. It is also used f...

متن کامل

Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences

Searching patterns in the DNA sequence is an important step in biological research. To speed up the search process, one can index the DNA sequence. However, classical indexing data structures like suffix trees and suffix arrays are not feasible for indexing DNA sequences due to main memory requirement, as DNA sequences can be very long. In this paper, we evaluate the performance of two compress...

متن کامل

essaMEM: finding maximal exact matches using enhanced sparse suffix arrays

We have developed essaMEM, a tool for finding maximal exact matches that can be used in genome comparison and read mapping. essaMEM enhances an existing sparse suffix array implementation with a sparse child array. Tests indicate that the enhanced algorithm for finding maximal exact matches is much faster, while maintaining the same memory footprint. In this way, sparse suffix arrays remain com...

متن کامل

Suffix arrays: what are they good for?

Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can ind...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011